Homework 2

In this homework, the aim is to determine whether a molecule is musk or not(non-musk). A dataset of all integers containing 476 instances with 166 features are given. In the dataset, there are 47 musk and 45 non-musk molecules which are classified according to their conformations. In the dataset, if any of the incident in a bag is “1” than, the molecule is defined as musk, otherwise if all is “0” it is non-musk.

library(data.table)
require(arules)
library(dplyr)
library(tidyverse)
library(ggbiplot)
library(ggplot2)
library(MASS)
library(jpeg)
library(imager)
library(magick)

##################TASK 1#########################

musk=fread("Musk1.csv",stringsAsFactors = FALSE)
head(musk[,1:16])
##    V1 V2 V3   V4   V5  V6   V7 V8 V9  V10 V11 V12  V13  V14  V15  V16
## 1:  1  1 42 -198 -109 -75 -117 11 23  -88 -28 -27 -232 -212  -66 -286
## 2:  1  1 42 -191 -142 -65 -117 55 49 -170 -45   5 -325 -115 -107 -281
## 3:  1  1 42 -191 -142 -75 -117 11 49 -161 -45 -28 -278 -115  -67 -274
## 4:  1  1 42 -198 -110 -65 -117 55 23  -95 -28   5 -301 -212 -107 -280
## 5:  1  2 42 -198 -102 -75 -117 10 24  -87 -28 -28 -233 -212  -67 -286
## 6:  1  2 42 -191 -142 -65 -117 55 49 -170 -45   6 -324 -114 -106 -280

For PCA analysis prcomp function is used where the data is scaled and centered to reduce the bias in the data coming from the high order of magnitudes data compared to low orders.

After that, multidimensional scattering is performed to the created distance matrice of musk data using different kinds of distance matrices. Shepard graphs show the goodness of the MDS analysis. As can be seen from the figures, Euclidian and Minkowski distances are the better way to create the distance matrixs. As for these cases, the red points are more scattered in the data but they are close to each other. However, blue points indicate are more apparent in the middle of the graph.

## initial  value 3.984595 
## final  value 3.984595 
## converged
## initial  value 0.736520 
## final  value 0.736513 
## converged
## initial  value 0.736520 
## final  value 0.736513 
## converged

In the second part of the first task, the bags are reduced to one vector by taking the mean of the bag members. With this reduction, PCA analysis get better and cumulative standard deviations are more rapidlgetting close to 1 and MDS values also improved.

## Importance of components:
##                           PC1    PC2    PC3    PC4     PC5     PC6     PC7
## Standard deviation     6.6528 5.6257 4.6908 4.2063 3.10049 2.75154 2.24499
## Proportion of Variance 0.2666 0.1907 0.1326 0.1066 0.05791 0.04561 0.03036
## Cumulative Proportion  0.2666 0.4573 0.5898 0.6964 0.75433 0.79993 0.83030
##                            PC8     PC9    PC10    PC11    PC12   PC13
## Standard deviation     1.88542 1.80613 1.60162 1.41639 1.29688 1.2561
## Proportion of Variance 0.02141 0.01965 0.01545 0.01209 0.01013 0.0095
## Cumulative Proportion  0.85171 0.87136 0.88681 0.89890 0.90903 0.9185
##                           PC14    PC15    PC16    PC17    PC18    PC19
## Standard deviation     1.23131 1.05946 1.02942 0.93512 0.91230 0.86691
## Proportion of Variance 0.00913 0.00676 0.00638 0.00527 0.00501 0.00453
## Cumulative Proportion  0.92767 0.93443 0.94081 0.94608 0.95110 0.95562
##                           PC20    PC21    PC22   PC23    PC24    PC25
## Standard deviation     0.79297 0.78039 0.74704 0.6937 0.65812 0.63570
## Proportion of Variance 0.00379 0.00367 0.00336 0.0029 0.00261 0.00243
## Cumulative Proportion  0.95941 0.96308 0.96644 0.9693 0.97195 0.97438
##                           PC26    PC27    PC28   PC29    PC30    PC31
## Standard deviation     0.62054 0.56340 0.53876 0.5149 0.50515 0.48016
## Proportion of Variance 0.00232 0.00191 0.00175 0.0016 0.00154 0.00139
## Cumulative Proportion  0.97670 0.97862 0.98036 0.9820 0.98350 0.98489
##                           PC32    PC33    PC34    PC35    PC36    PC37
## Standard deviation     0.46820 0.44470 0.42894 0.40874 0.39965 0.38054
## Proportion of Variance 0.00132 0.00119 0.00111 0.00101 0.00096 0.00087
## Cumulative Proportion  0.98621 0.98740 0.98851 0.98951 0.99048 0.99135
##                          PC38    PC39    PC40    PC41    PC42    PC43
## Standard deviation     0.3643 0.36024 0.32250 0.31709 0.29101 0.28987
## Proportion of Variance 0.0008 0.00078 0.00063 0.00061 0.00051 0.00051
## Cumulative Proportion  0.9921 0.99293 0.99356 0.99416 0.99467 0.99518
##                           PC44   PC45    PC46    PC47    PC48    PC49
## Standard deviation     0.27843 0.2579 0.25075 0.22941 0.21491 0.21031
## Proportion of Variance 0.00047 0.0004 0.00038 0.00032 0.00028 0.00027
## Cumulative Proportion  0.99565 0.9960 0.99643 0.99674 0.99702 0.99729
##                           PC50    PC51    PC52   PC53    PC54    PC55
## Standard deviation     0.19940 0.19567 0.18698 0.1828 0.17357 0.16440
## Proportion of Variance 0.00024 0.00023 0.00021 0.0002 0.00018 0.00016
## Cumulative Proportion  0.99753 0.99776 0.99797 0.9982 0.99835 0.99851
##                           PC56    PC57    PC58    PC59   PC60    PC61
## Standard deviation     0.15556 0.15071 0.14823 0.13348 0.1277 0.12491
## Proportion of Variance 0.00015 0.00014 0.00013 0.00011 0.0001 0.00009
## Cumulative Proportion  0.99866 0.99880 0.99893 0.99904 0.9991 0.99923
##                           PC62    PC63    PC64    PC65    PC66    PC67
## Standard deviation     0.11920 0.11037 0.09867 0.09513 0.09437 0.09016
## Proportion of Variance 0.00009 0.00007 0.00006 0.00005 0.00005 0.00005
## Cumulative Proportion  0.99931 0.99939 0.99945 0.99950 0.99955 0.99960
##                           PC68    PC69    PC70    PC71    PC72    PC73
## Standard deviation     0.08776 0.08368 0.07755 0.07694 0.07280 0.06549
## Proportion of Variance 0.00005 0.00004 0.00004 0.00004 0.00003 0.00003
## Cumulative Proportion  0.99965 0.99969 0.99973 0.99976 0.99979 0.99982
##                           PC74    PC75    PC76    PC77    PC78    PC79
## Standard deviation     0.06194 0.06145 0.05420 0.05184 0.04933 0.04641
## Proportion of Variance 0.00002 0.00002 0.00002 0.00002 0.00001 0.00001
## Cumulative Proportion  0.99984 0.99987 0.99988 0.99990 0.99992 0.99993
##                           PC80    PC81    PC82    PC83    PC84    PC85
## Standard deviation     0.04430 0.04222 0.04009 0.03499 0.03489 0.03129
## Proportion of Variance 0.00001 0.00001 0.00001 0.00001 0.00001 0.00001
## Cumulative Proportion  0.99994 0.99995 0.99996 0.99997 0.99998 0.99998
##                           PC86    PC87    PC88    PC89   PC90    PC91
## Standard deviation     0.02695 0.02536 0.02481 0.02177 0.0201 0.01676
## Proportion of Variance 0.00000 0.00000 0.00000 0.00000 0.0000 0.00000
## Cumulative Proportion  0.99999 0.99999 0.99999 1.00000 1.0000 1.00000
##                             PC92
## Standard deviation     7.013e-16
## Proportion of Variance 0.000e+00
## Cumulative Proportion  1.000e+00

## initial  value 4.024334 
## final  value 4.024334 
## converged

## initial  value 0.073393 
## final  value 0.073373 
## converged

## initial  value 0.073393 
## final  value 0.073373 
## converged

#TASK 2

In this part image processing is carried out. First, an anime image is read and than processed to have noisy image. After turning the colorful image to grayscale image, 3x3 patches are extracted and PCA analysis is performed. The image can be replotted with the first PCA.

## Standard deviations (1, .., p=9):
## [1] 0.82942284 0.22939386 0.19443172 0.14504610 0.13750839 0.07754772
## [7] 0.05479875 0.04775727 0.03548310
## 
## Rotation (n x k) = (9 x 9):
##              PC1         PC2          PC3        PC4        PC5
##  [1,] -0.3265218 -0.37642247  0.431648049 -0.2486016 -0.1936102
##  [2,] -0.3397919  0.04917718  0.422102838 -0.2128238  0.4693133
##  [3,] -0.3271763  0.42167753  0.366805668 -0.2538682 -0.2421919
##  [4,] -0.3317644 -0.42929957  0.034382258  0.4447482 -0.2728940
##  [5,] -0.3464645  0.01115158 -0.001670066  0.5129299  0.4528968
##  [6,] -0.3341491  0.41409718 -0.032337218  0.4426103 -0.2966252
##  [7,] -0.3247961 -0.43607100 -0.365410958 -0.2533659 -0.2235271
##  [8,] -0.3397860 -0.02681887 -0.425401246 -0.2128686  0.4686131
##  [9,] -0.3288963  0.36235295 -0.429232740 -0.2492888 -0.2155632
##                PC6           PC7         PC8        PC9
##  [1,]  0.507872321 -0.2794145018  0.31577278  0.1742490
##  [2,] -0.016400633  0.5639243309 -0.03382133 -0.3449346
##  [3,] -0.491288354 -0.3213517864 -0.28110962  0.1839497
##  [4,]  0.002250105 -0.0306776455 -0.56612608 -0.3327688
##  [5,] -0.010460559  0.0003755909  0.00199389  0.6414897
##  [6,]  0.003009552  0.0303605266  0.56521354 -0.3339680
##  [7,] -0.491683032  0.3204387420  0.28091771  0.1825317
##  [8,] -0.015060312 -0.5636411475  0.03367101 -0.3448142
##  [9,]  0.508261662  0.2800260049 -0.31639136  0.1755747
## Importance of components:
##                           PC1     PC2    PC3     PC4     PC5     PC6
## Standard deviation     0.8294 0.22939 0.1944 0.14505 0.13751 0.07755
## Proportion of Variance 0.8280 0.06333 0.0455 0.02532 0.02276 0.00724
## Cumulative Proportion  0.8280 0.89131 0.9368 0.96213 0.98489 0.99213
##                            PC7     PC8     PC9
## Standard deviation     0.05480 0.04776 0.03548
## Proportion of Variance 0.00361 0.00275 0.00152
## Cumulative Proportion  0.99574 0.99848 1.00000